Data Preparation and Processing

The process began with data preparation and processing. The first step involved calculating crime rates. Since official crime rate statistics are not directly provided at the LSOA level, we performed this calculation independently. We utilized LSOA-level usual resident population data from the 2021 Census and crime record data from the Metropolitan Police Service for the year 2021.

## # A tibble: 6 × 6
##   `LSOA Code` `LSOA Name`        Borough Total_Crime_Count Population Crime_Rate
##   <chr>       <chr>              <chr>               <dbl>      <dbl>      <dbl>
## 1 E01000006   Barking and Dagen… E09000…                94       1845       50.9
## 2 E01000007   Barking and Dagen… E09000…               507       2908      174. 
## 3 E01000008   Barking and Dagen… E09000…               224       1795      125. 
## 4 E01000009   Barking and Dagen… E09000…               298       1804      165. 
## 5 E01000011   Barking and Dagen… E09000…               111       1701       65.3
## 6 E01000012   Barking and Dagen… E09000…               142       2347       60.5
## 
## Success! The result file 'LSOA_Crime_Rate_2021_With_Names.csv' now includes LSOA names.

Data Processing and Integration

Following the initial data preparation, the next stage involved processing and merging the analytical dataset. With the exception of crime rates, all primary variables in this study were sourced from the 2021 Census. The decision to utilize this specific dataset was driven by considerations of data completeness and consistency in formatting. Using the 2021 Census ensures the highest degree of reliability for the analysis results.

The following section details the specific selection of variables and their subsequent naming conventions. Finally, all variables were merged into a single comprehensive working dataset. Based on prior research and our preliminary analysis, logarithmic transformations were applied to selected variables at this stage to address skewness.

## 
## Merge Successful!
## Total rows: 4988
## Total variables: 38
## Confirmed inclusion of variable: pct_level4_qual (Higher Education)
## File saved as: London_LSOA_Final_Model_Data_v3.csv

Descriptive Analysis

Subsequently, we conducted a descriptive statistical analysis of crime rate data across London boroughs. The results clearly distinguish between high-risk and low-risk areas within the city, revealing significant spatial disparities.

## [1] "Statistics table saved: LSOA_Descriptive_Statistics_Report.csv"

## png 
##   2
## [1] "Chart_3_Correlation_Matrix.png"
## 
## All charts generated! Please check the PNG images in the folder.

Exploratory Data Analysis

This section initiates the exploratory analysis phase. We began by conducting normality tests on the primary variables, which revealed that the majority of demographic variables (e.g., religious composition) exhibited significant distributional skewness. To address this, logarithmic transformations were applied to key variables in advance. Furthermore, to mitigate the impact of this skewness, a combination of Spearman and Pearson correlation methods was employed in the subsequent correlation analysis.

## 
## --- LSOA Variable Distribution Assessment Report (Based on Skewness) ---
##                variable statistic      p.value    skewness
## 1            Crime_Rate 0.3685777 3.239666e-85 13.23432014
## 2            pct_jewish 0.2828450 4.427402e-88  7.01130205
## 3              pct_sikh 0.3514268 8.157895e-86  6.03373874
## 4             pct_16_19 0.7187986 3.803169e-68  5.06866237
## 5           Job_Density 0.7569391 2.809444e-65  4.39164476
## 6          pct_buddhist 0.7993706 1.341829e-61  3.28781997
## 7             pct_hindu 0.6503941 1.504138e-72  3.06709844
## 8           Pop_Density 0.8411438 2.862744e-57  2.74751781
## 9             pct_20_24 0.7983395 1.073557e-61  2.64266145
## 10      pct_new_migrant 0.8496540 2.845226e-56  1.80930161
## 11           pct_muslim 0.8470371 1.388409e-56  1.63254118
## 12        pct_born_asia 0.8495826 2.789701e-56  1.61000431
## 13      pct_born_africa 0.9017181 7.084446e-49  1.39546005
## 14    pct_born_americas 0.9098315 1.922297e-47  1.26977301
## 15      log_pop_density 0.9490044 1.622746e-38 -1.04887880
## 16      log_job_density 0.9565763 3.573233e-36 -0.92806515
## 17       log_crime_rate 0.9642122 1.863186e-33  0.83580658
## 18             pct_male 0.9535228 3.731902e-37  0.74908694
## 19       pct_bad_health 0.9722962 4.973730e-30  0.72502361
## 20      pct_overcrowded 0.9502477 3.758313e-38  0.71848301
## 21         pct_disabled 0.9739052 2.946788e-29  0.70969342
## 22 pct_elementary_occup 0.9556265 1.746430e-36  0.69761753
## 23      pct_born_europe 0.9784964 7.864644e-27 -0.61895647
## 24        pct_christian 0.9742063 4.149803e-29 -0.59418139
## 25       pct_unemployed 0.9786538 9.674061e-27  0.58216061
## 26   pct_private_rented 0.9797458 4.204051e-26  0.47894220
## 27      pct_level4_qual 0.9645341 2.480384e-33  0.39516542
## 28          pct_no_qual 0.9891000 3.436640e-19  0.19061416
## 29         pct_deprived 0.9890197 2.890549e-19  0.16480399
## 30      pct_no_religion 0.9869718 4.600704e-21 -0.10708486
## 31 pct_hh_with_disabled 0.9963926 1.145266e-09 -0.08995187
##                    dist_type            suggestion
## 1              Highly Skewed Suggest Log Transform
## 2              Highly Skewed Suggest Log Transform
## 3              Highly Skewed Suggest Log Transform
## 4              Highly Skewed Suggest Log Transform
## 5              Highly Skewed Suggest Log Transform
## 6              Highly Skewed Suggest Log Transform
## 7              Highly Skewed Suggest Log Transform
## 8              Highly Skewed Suggest Log Transform
## 9              Highly Skewed Suggest Log Transform
## 10             Highly Skewed Suggest Log Transform
## 11             Highly Skewed Suggest Log Transform
## 12             Highly Skewed Suggest Log Transform
## 13             Highly Skewed Suggest Log Transform
## 14             Highly Skewed Suggest Log Transform
## 15             Highly Skewed Suggest Log Transform
## 16        Slight Skew (Good)         Keep Original
## 17        Slight Skew (Good)         Keep Original
## 18        Slight Skew (Good)         Keep Original
## 19        Slight Skew (Good)         Keep Original
## 20        Slight Skew (Good)         Keep Original
## 21        Slight Skew (Good)         Keep Original
## 22        Slight Skew (Good)         Keep Original
## 23        Slight Skew (Good)         Keep Original
## 24        Slight Skew (Good)         Keep Original
## 25        Slight Skew (Good)         Keep Original
## 26 Approx Normal (Excellent)         Keep Original
## 27 Approx Normal (Excellent)         Keep Original
## 28 Approx Normal (Excellent)         Keep Original
## 29 Approx Normal (Excellent)         Keep Original
## 30 Approx Normal (Excellent)         Keep Original
## 31 Approx Normal (Excellent)         Keep Original

Correlation Analysis

Following the logarithmic transformation of key variables, we utilized both Spearman and Pearson correlation tests to investigate the primary research question regarding the relationship between socio-economic variables and crime rates. The results were visualized using a lollipop chart to provide a clear comparison of effect sizes. The interpretation of these findings primarily relies on the Spearman rank correlation method, given its robustness against non-normal data distributions compared to Pearson’s method.

## List of independent variables for correlation analysis:
##  [1] "log_job_density"      "pct_private_rented"   "pct_overcrowded"     
##  [4] "pct_new_migrant"      "pct_christian"        "pct_muslim"          
##  [7] "pct_hindu"            "pct_jewish"           "pct_sikh"            
## [10] "pct_buddhist"         "pct_no_religion"      "pct_male"            
## [13] "pct_unemployed"       "pct_disabled"         "pct_bad_health"      
## [16] "pct_no_qual"          "pct_level4_qual"      "pct_16_19"           
## [19] "pct_20_24"            "pct_deprived"         "pct_born_europe"     
## [22] "pct_born_africa"      "pct_born_asia"        "pct_born_americas"   
## [25] "pct_elementary_occup" "pct_hh_with_disabled" "log_pop_density"
## 
## --- Correlation Analysis Results: Variables vs [Log Crime Rate] ---
##                      Variable Pearson_r Pearson_p Spearman_rho Spearman_p
## cor...1        pct_unemployed     0.393    <2e-16        0.471     <2e-16
## cor...2       pct_overcrowded     0.305    <2e-16        0.401     <2e-16
## cor...3             pct_20_24     0.389    <2e-16        0.401     <2e-16
## cor...4    pct_private_rented     0.411    <2e-16        0.379     <2e-16
## cor...5     pct_born_americas     0.346    <2e-16        0.374     <2e-16
## cor...6       pct_new_migrant     0.393    <2e-16        0.373     <2e-16
## cor...7          pct_deprived     0.291    <2e-16        0.368     <2e-16
## cor...8        pct_bad_health     0.257    <2e-16        0.333     <2e-16
## cor...9       pct_born_africa     0.221    <2e-16        0.312     <2e-16
## cor...10      pct_born_europe    -0.281    <2e-16       -0.302     <2e-16
## cor...11           pct_muslim     0.201    <2e-16        0.295     <2e-16
## cor...12 pct_elementary_occup     0.215    <2e-16        0.277     <2e-16
## cor...13         pct_disabled     0.194    <2e-16        0.262     <2e-16
## cor...14            pct_hindu    -0.185    <2e-16       -0.255     <2e-16
## cor...15      log_pop_density     0.122    <2e-16        0.214     <2e-16
## cor...16      log_job_density     0.118    <2e-16        0.199     <2e-16
## cor...17        pct_christian    -0.153    <2e-16       -0.164     <2e-16
## cor...18          pct_no_qual     0.104  1.52e-13        0.161     <2e-16
## cor...19             pct_sikh    -0.053  0.000189       -0.147     <2e-16
## cor...20         pct_buddhist     0.142    <2e-16        0.135     <2e-16
## cor...21 pct_hh_with_disabled     0.023     0.108        0.105   8.85e-14
## cor...22        pct_born_asia     0.088  4.03e-10        0.075   9.47e-08
## cor...23             pct_male     0.017     0.222       -0.047   0.000862
## cor...24            pct_16_19     0.051  0.000277        0.016      0.269
## cor...25      pct_level4_qual     0.052  0.000241        0.014      0.339
## cor...26      pct_no_religion     0.035    0.0147        0.011      0.452
## cor...27           pct_jewish    -0.060     2e-05        0.008      0.591
##          Significant
## cor...1          YES
## cor...2          YES
## cor...3          YES
## cor...4          YES
## cor...5          YES
## cor...6          YES
## cor...7          YES
## cor...8          YES
## cor...9          YES
## cor...10         YES
## cor...11         YES
## cor...12         YES
## cor...13         YES
## cor...14         YES
## cor...15         YES
## cor...16         YES
## cor...17         YES
## cor...18         YES
## cor...19         YES
## cor...20         YES
## cor...21         YES
## cor...22         YES
## cor...23         YES
## cor...24          NO
## cor...25          NO
## cor...26          NO
## cor...27          NO

Education Variable Screening

Correlation analysis revealed a surprisingly weak association between educational indicators and crime rates. This finding diverges from prior research and conventional wisdom, which typically posit a strong link between educational attainment and crime. Given this discrepancy, we conducted an independent test specifically for the education variable. The results confirmed that its effect was not statistically significant in this context. Consequently, the variable representing higher educational qualifications was excluded from the subsequent modeling process.

## 
## LSOA Level Education vs Crime: Masking Effect Analysis
## ==============================================================================================
##                                                Dependent variable:                            
##                     --------------------------------------------------------------------------
##                                                  Crime Rate (Log)                             
##                       Low Edu (Univariate)    High Edu (Univariate)    High Edu (Controlled)  
##                               (1)                      (2)                      (3)           
## ----------------------------------------------------------------------------------------------
## Pct No Qual                 0.010***                                                          
##                             (0.001)                                                           
##                                                                                               
## Pct Level 4+                                         0.002***                  0.001          
##                                                      (0.001)                  (0.001)         
##                                                                                               
## Job Density (Log)                                                             0.085***        
##                                                                               (0.011)         
##                                                                                               
## Constant                    4.076***                 4.126***                 3.491***        
##                             (0.022)                  (0.029)                  (0.088)         
##                                                                                               
## ----------------------------------------------------------------------------------------------
## Observations                 4,988                    4,988                    4,988          
## R2                           0.011                    0.003                    0.014          
## Adjusted R2                  0.011                    0.003                    0.014          
## Residual Std. Error    0.582 (df = 4986)        0.584 (df = 4986)        0.581 (df = 4985)    
## F Statistic         54.851*** (df = 1; 4986) 13.501*** (df = 1; 4986) 36.170*** (df = 2; 4985)
## ==============================================================================================
## Note:                                                              *p<0.1; **p<0.05; ***p<0.01

## 
## Analysis complete! Please check the difference between the two plots in 'LSOA_Education_Paradox_Analysis.png'.

Linearity Assessment and Variable Transformation

Following the initial correlation screening, we employed scatter plots to visually assess the linearity and strength of association for the remaining high-potential variables. The majority of these key predictors demonstrated satisfactory linearity, justifying their retention for the exploratory regression phase. However, diagnostic observations revealed that the Youth and Migrant population variables exhibited residual skewness. To address this and improve model fit, we proceeded to conduct a comparative analysis using logarithmic transformations for these specific demographic indicators.

## 
## Charts Generated:
## 1. LSOA_Scatter_Strong.png (Strong Correlation)
## 2. LSOA_Scatter_Moderate.png (Moderate/Characteristic)

Logarithmic Transformation Strategy

We proceeded to conduct an exploratory analysis using logarithmic transformations for these variables. The results confirmed that the transformed data aligned more consistently with the underlying analytical logic and statistical assumptions (e.g., linearity and normality). Consequently, logarithmic transformations were formally applied to these variables for the subsequent regression analysis.

## 
## --- Skewness Improvement Report ---
## 1. Youth Population (pct_20_24):
##    Raw Skewness: 2.643  ->  Log Skewness: 0.786 (Significant Improvement)
## 2. New Migrants (pct_new_migrant):
##    Raw Skewness: 1.809  ->  Log Skewness: 0.495 (Near Normal)

## 
## Variable Form Performance Comparison: Raw vs Log
## ============================================================
##                                     Dependent variable:     
##                                 ----------------------------
##                                       Crime Rate (Log)      
##                                 Raw Proportions   Log Forms 
##                                       (1)            (2)    
## ------------------------------------------------------------
## Job Density                        -0.1229***    -0.1418*** 
##                                     (0.0076)      (0.0077)  
##                                                             
## Unemployment                       0.0776***      0.0661*** 
##                                     (0.0100)      (0.0099)  
##                                                             
## Deprivation                        0.2482***      0.2274*** 
##                                     (0.0105)      (0.0102)  
##                                                             
## Youth (Raw)                        -0.0413***               
##                                     (0.0097)                
##                                                             
## New Migrants (Raw)                 0.3709***                
##                                     (0.0109)                
##                                                             
## Youth (Log)                                       0.0325*** 
##                                                   (0.0087)  
##                                                             
## New Migrants (Log)                                0.3402*** 
##                                                   (0.0095)  
##                                                             
## Constant                           4.2291***      4.2291*** 
##                                     (0.0066)      (0.0065)  
##                                                             
## ------------------------------------------------------------
## Observations                         4,988          4,988   
## R2                                   0.3723        0.3855   
## Adjusted R2                          0.3717        0.3849   
## Residual Std. Error (df = 4982)      0.4639        0.4590   
## F Statistic (df = 5; 4982)        591.0408***    625.0719***
## ============================================================
## Note:                            *p<0.1; **p<0.05; ***p<0.01
## 
## Conclusion: Please compare the Adjusted R-squared of the two models.
## 
## Typically, the Log Model (Model 2) exhibits a higher R-squared and more significant t-values, indicating that the logarithmic form better captures the true underlying patterns of the data.

Finalizing the Analytical Dataset

Following the logarithmic transformations, the working dataset was updated to serve as the final analytical table.

## 
## Success! New file generated: 'London_LSOA_Final_Model_Data_v4.csv'
## Newly included variables: log_youth, log_migrant
## Total number of variables: 40

Regression Analysis and Variable Selection Strategy

Following the completion of the exploratory analysis, we progressed to the regression modeling phase. Initially, models were specified based on theoretical frameworks and hypothesized explanatory factors. However, these preliminary specifications yielded suboptimal model fit and limited explanatory power. Consequently, to identify the most robust predictors and address potential redundancy, we employed the LASSO (Least Absolute Shrinkage and Selection Operator) technique for automated variable selection in the subsequent analysis.

## Current Sample Size: N = 4988
## Sample count including Westminster: 0 (These are key high-leverage points)
## 
## LSOA Crime Rate Regression Results (Including Westminster)
## ===============================================================================================
##                                                Dependent variable:                             
##                    ----------------------------------------------------------------------------
##                                                   Log Crime Rate                               
##                    Economic +Strain  +Opportunity  +Demog  +Health(Core) Alt.Poverty Full Model
##                      (1)      (2)        (3)        (4)         (5)          (6)        (7)    
## -----------------------------------------------------------------------------------------------
## Unemployment       0.140*** 0.124***   0.134***   0.074***   0.050***                 0.062*** 
##                    (0.005)  (0.006)    (0.006)    (0.006)     (0.006)                 (0.006)  
##                                                                                                
## Overcrowding                                                              0.027***             
##                                                                            (0.003)             
##                                                                                                
## Private Rented              0.005***   0.019***   0.022***   0.022***     0.017***    0.009*** 
##                             (0.002)    (0.002)    (0.002)     (0.001)      (0.002)    (0.002)  
##                                                                                                
## Job Density (Log)                                                                     0.013*** 
##                                                                                       (0.001)  
##                                                                                                
## Pop Density (Log)                      1.026***    -0.017    0.416***     0.437***    0.293*** 
##                                        (0.064)    (0.068)     (0.069)      (0.069)    (0.068)  
##                                                                                                
## Youth (Log)                           -1.138***   -0.187**   -0.643***    -0.656***  -0.526*** 
##                                        (0.070)    (0.071)     (0.072)      (0.072)    (0.071)  
##                                                                                                
## New Migrants (Log)                                0.203***   0.159***     0.165***    0.191*** 
##                                                   (0.028)     (0.027)      (0.027)    (0.027)  
##                                                                                                
## Bad Health                                        0.604***   0.659***     0.731***    0.376*** 
##                                                   (0.023)     (0.022)      (0.023)    (0.028)  
##                                                                                                
## Deprivation                                                  0.100***     0.067***    0.122*** 
##                                                               (0.005)      (0.007)    (0.005)  
##                                                                                                
## Constant           3.560*** 3.575***   5.116***   3.524***   3.677***     3.460***    3.900*** 
##                    (0.023)  (0.024)    (0.121)    (0.118)     (0.114)      (0.114)    (0.112)  
##                                                                                                
## -----------------------------------------------------------------------------------------------
## Observations        4,988    4,988      4,988      4,988       4,988        4,988      4,988   
## R2                  0.154    0.157      0.199      0.352       0.399        0.400      0.431   
## Adjusted R2         0.154    0.156      0.199      0.351       0.398        0.400      0.430   
## ===============================================================================================
## Note:                                                             *p<0.05; **p<0.01; ***p<0.001
## 
## --- Multicollinearity Diagnosis (VIF) ---
##     pct_unemployed    pct_overcrowded pct_private_rented    log_job_density 
##           2.389109           3.177242           2.878585          71.601300 
##    log_pop_density          log_youth        log_migrant     pct_bad_health 
##          71.986135           1.790695           3.664994           1.733615

check moodle and

LASSO Selection and Outlier Management

In this section, we applied the LASSO (Least Absolute Shrinkage and Selection Operator) method to implement a data-driven feature selection process. Concurrently, we refined the dataset by excluding high-leverage outliers, specifically the Westminster area, which had previously distorted model estimates due to its unique non-residential characteristics. Based on the LASSO selection results, we systematically removed variables exhibiting significant multicollinearity and reorganized the remaining predictors into a new, optimized variable combination for the final regression analysis.

## Data cleaning complete.
## Original sample size: 4988  -> Sample size after cleaning: 4865
## Lasso matrix preparation complete. Matrix dimensions: 4865 21
## Best Lambda selected by Lasso: 0.0002038313

## 
## --- Variables Selected by Lasso and Their Coefficients ---
##                Variable          Coef
## 1           log_migrant  1.893884e-01
## 2             log_youth  1.813297e-01
## 3       log_pop_density -1.502182e-01
## 4       log_job_density -1.115536e-01
## 5        pct_bad_health  6.962884e-02
## 6          pct_deprived  2.859714e-02
## 7       pct_born_europe -2.717287e-02
## 8         pct_born_asia -2.651869e-02
## 9        pct_unemployed  2.374417e-02
## 10         pct_disabled  1.499216e-02
## 11      pct_born_africa -1.390243e-02
## 12 pct_hh_with_disabled -1.371695e-02
## 13   pct_private_rented  1.362659e-02
## 14      pct_no_religion  1.196217e-02
## 15      pct_overcrowded  9.273849e-03
## 16 pct_elementary_occup -5.329418e-03
## 17           pct_muslim  4.236854e-03
## 18          pct_no_qual  1.946132e-03
## 19        pct_christian  3.098296e-04
## 20    pct_born_americas -3.012018e-04
## 21      pct_level4_qual -8.121339e-05
## 
## 
## === Final OLS Regression Results (Using only variables selected by Lasso) ===
## 
## Call:
## lm(formula = as.formula(formula_str), data = df_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25508 -0.25566 -0.02484  0.21303  2.43584 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           9.3919011  1.1700845   8.027 1.25e-15 ***
## log_migrant           0.1835872  0.0326657   5.620 2.01e-08 ***
## log_youth             0.1752009  0.0279149   6.276 3.77e-10 ***
## log_pop_density      -0.1467902  0.0775744  -1.892 0.058517 .  
## log_job_density      -0.1172906  0.0755337  -1.553 0.120530    
## pct_bad_health        0.0684906  0.0103224   6.635 3.60e-11 ***
## pct_deprived          0.0283102  0.0038424   7.368 2.03e-13 ***
## pct_born_europe      -0.0529002  0.0111592  -4.740 2.19e-06 ***
## pct_born_asia        -0.0519122  0.0110915  -4.680 2.94e-06 ***
## pct_unemployed        0.0252033  0.0063451   3.972 7.23e-05 ***
## pct_disabled          0.0161960  0.0077047   2.102 0.035596 *  
## pct_born_africa      -0.0390266  0.0111424  -3.503 0.000465 ***
## pct_hh_with_disabled -0.0140946  0.0028949  -4.869 1.16e-06 ***
## pct_private_rented    0.0136975  0.0008717  15.713  < 2e-16 ***
## pct_no_religion       0.0117463  0.0012337   9.522  < 2e-16 ***
## pct_overcrowded       0.0093510  0.0024333   3.843 0.000123 ***
## pct_elementary_occup -0.0055036  0.0030131  -1.827 0.067825 .  
## pct_muslim            0.0042815  0.0009974   4.293 1.80e-05 ***
## pct_no_qual           0.0010616  0.0031031   0.342 0.732287    
## pct_christian         0.0004385  0.0010183   0.431 0.666767    
## pct_born_americas    -0.0266085  0.0117499  -2.265 0.023584 *  
## pct_level4_qual      -0.0013797  0.0014663  -0.941 0.346774    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4101 on 4843 degrees of freedom
## Multiple R-squared:  0.4805, Adjusted R-squared:  0.4783 
## F-statistic: 213.3 on 21 and 4843 DF,  p-value: < 2.2e-16
## 
## 
## --- Multicollinearity Diagnosis (VIF) ---
##          log_migrant            log_youth      log_pop_density 
##             5.593808             2.182870            96.681136 
##      log_job_density       pct_bad_health         pct_deprived 
##           100.305988             7.657706            12.255740 
##      pct_born_europe        pct_born_asia       pct_unemployed 
##           362.188016           307.378687             3.144761 
##         pct_disabled      pct_born_africa pct_hh_with_disabled 
##             6.861894            53.261983             8.324533 
##   pct_private_rented      pct_no_religion      pct_overcrowded 
##             3.963041             6.251461             8.232763 
## pct_elementary_occup           pct_muslim          pct_no_qual 
##             7.365420             3.966967            11.412632 
##        pct_christian    pct_born_americas      pct_level4_qual 
##             2.875272            32.865362            11.741762

check moodle

Model Optimization based on Combined Selection Strategies

Synthesizing the insights from the LASSO feature selection and the preliminary scatter plot correlation analysis, we curated the maximal set of relevant predictors for this iteration. The resulting model demonstrated superior performance metrics compared to previous specifications. Consequently, we proceeded to further refine and optimize the analysis based on this robust foundational model.

## 
## Call:
## lm(formula = formula_top18, data = df_final)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.31287 -0.26592 -0.02926  0.22153  2.89818 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.8200878  0.0998342  48.281  < 2e-16 ***
## pct_unemployed        0.0187671  0.0063449   2.958 0.003113 ** 
## pct_private_rented    0.0139796  0.0008487  16.472  < 2e-16 ***
## pct_20_24             0.0115680  0.0033701   3.432 0.000603 ***
## pct_new_migrant       0.0214706  0.0020588  10.429  < 2e-16 ***
## pct_born_americas     0.0290802  0.0028539  10.190  < 2e-16 ***
## pct_overcrowded       0.0108474  0.0024654   4.400 1.11e-05 ***
## pct_deprived          0.0328997  0.0038943   8.448  < 2e-16 ***
## pct_bad_health        0.0817207  0.0103929   7.863 4.56e-15 ***
## pct_born_africa       0.0009449  0.0021997   0.430 0.667520    
## pct_muslim            0.0019756  0.0008933   2.212 0.027038 *  
## pct_elementary_occup -0.0027321  0.0028282  -0.966 0.334080    
## pct_disabled          0.0109597  0.0077623   1.412 0.158039    
## log_job_density      -0.2465234  0.0098513 -25.024  < 2e-16 ***
## pct_buddhist          0.0537683  0.0097545   5.512 3.72e-08 ***
## pct_no_qual          -0.0084951  0.0024551  -3.460 0.000544 ***
## pct_born_asia        -0.0092158  0.0010388  -8.872  < 2e-16 ***
## pct_hh_with_disabled -0.0123193  0.0028644  -4.301 1.73e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4269 on 4970 degrees of freedom
## Multiple R-squared:  0.4696, Adjusted R-squared:  0.4678 
## F-statistic: 258.9 on 17 and 4970 DF,  p-value: < 2.2e-16
## 
## --- Variance Inflation Factor (VIF) Check ---
##       pct_unemployed   pct_private_rented            pct_20_24 
##             2.971488             3.681908             2.606561 
##      pct_new_migrant    pct_born_americas      pct_overcrowded 
##             5.710225             1.883874             7.968851 
##         pct_deprived       pct_bad_health      pct_born_africa 
##            12.084619             7.722217             1.934892 
##           pct_muslim pct_elementary_occup         pct_disabled 
##             3.030105             6.113533             6.847286 
##      log_job_density         pct_buddhist          pct_no_qual 
##             1.625039             1.199759             6.787904 
##        pct_born_asia pct_hh_with_disabled 
##             2.527744             8.054382

check moodle

Spatial Fixed Effects Optimization

Building upon the previous specification, we further optimized the model by introducing Borough-level fixed effects to control for unobserved spatial heterogeneity across London’s administrative districts. A coefficient plot was generated to visualize the specific impact of these location effects. The final model demonstrated satisfactory goodness-of-fit, and multicollinearity diagnostics confirmed that variable variance inflation remained within acceptable limits.

## Number of Boroughs: 32
## 
## Regression Results with Borough Fixed Effects
## =================================================
##                           Dependent variable:    
##                       ---------------------------
##                             log_crime_rate       
## -------------------------------------------------
## pct_unemployed                 0.017***          
##                                 (0.006)          
##                                                  
## pct_private_rented             0.015***          
##                                 (0.001)          
##                                                  
## pct_20_24                       0.008**          
##                                 (0.003)          
##                                                  
## pct_new_migrant                0.021***          
##                                 (0.002)          
##                                                  
## pct_born_americas              0.023***          
##                                 (0.004)          
##                                                  
## pct_overcrowded                0.010***          
##                                 (0.003)          
##                                                  
## pct_deprived                   0.030***          
##                                 (0.004)          
##                                                  
## pct_bad_health                 0.069***          
##                                 (0.010)          
##                                                  
## pct_born_africa                 0.005*           
##                                 (0.003)          
##                                                  
## pct_muslim                      0.003**          
##                                 (0.001)          
##                                                  
## pct_elementary_occup             0.002           
##                                 (0.003)          
##                                                  
## pct_disabled                     0.007           
##                                 (0.008)          
##                                                  
## log_job_density                -0.263***         
##                                 (0.010)          
##                                                  
## pct_buddhist                   0.038***          
##                                 (0.010)          
##                                                  
## pct_no_qual                    -0.008***         
##                                 (0.003)          
##                                                  
## pct_born_asia                  -0.006***         
##                                 (0.001)          
##                                                  
## pct_hh_with_disabled           -0.010***         
##                                 (0.003)          
##                                                  
## Constant                       5.023***          
##                                 (0.110)          
##                                                  
## -------------------------------------------------
## Borough fixed effects             Yes            
## Observations                     4,988           
## R2                               0.497           
## Adjusted R2                      0.493           
## =================================================
## Note:                 *p<0.1; **p<0.05; ***p<0.01
##                              GVIF Df GVIF^(1/(2*Df))
## pct_unemployed           3.175212  1        1.781913
## pct_private_rented       4.285733  1        2.070201
## pct_20_24                2.844538  1        1.686576
## pct_new_migrant          6.490345  1        2.547616
## pct_born_americas        3.546239  1        1.883146
## pct_overcrowded          8.683568  1        2.946790
## pct_deprived            12.722013  1        3.566793
## pct_bad_health           8.146750  1        2.854251
## pct_born_africa          2.677882  1        1.636423
## pct_muslim               5.101017  1        2.258543
## pct_elementary_occup     7.784596  1        2.790089
## pct_disabled             7.019535  1        2.649441
## log_job_density          1.833992  1        1.354250
## pct_buddhist             1.333016  1        1.154563
## pct_no_qual              8.395485  1        2.897496
## pct_born_asia            4.201771  1        2.049822
## pct_hh_with_disabled     8.801903  1        2.966800
## factor(Derived_Borough) 90.188382 31        1.075312

Exclusion of Ethnicity and Religion Variables Building upon the fixed-effects model, we conducted a sensitivity analysis by excluding variables related to ethnicity and religion. The results indicated that removing these factors had a negligible impact on the model’s overall explanatory power (Adjusted \(R^2\)). Consequently, adhering to the principle of model parsimony, these variables were excluded from the final model specification.

## Number of Boroughs: 32
## 
## Refined Model Results (Ethnicity/Religion Excluded)
## =================================================
##                           Dependent variable:    
##                       ---------------------------
##                             log_crime_rate       
## -------------------------------------------------
## pct_unemployed                 0.030***          
##                                 (0.006)          
##                                                  
## pct_private_rented             0.015***          
##                                 (0.001)          
##                                                  
## pct_20_24                       0.007*           
##                                 (0.003)          
##                                                  
## pct_new_migrant                0.020***          
##                                 (0.002)          
##                                                  
## pct_overcrowded                0.007***          
##                                 (0.002)          
##                                                  
## pct_deprived                   0.033***          
##                                 (0.004)          
##                                                  
## pct_bad_health                 0.074***          
##                                 (0.010)          
##                                                  
## pct_elementary_occup            0.006**          
##                                 (0.003)          
##                                                  
## pct_disabled                    0.013*           
##                                 (0.008)          
##                                                  
## log_job_density                -0.255***         
##                                 (0.010)          
##                                                  
## pct_no_qual                    -0.012***         
##                                 (0.003)          
##                                                  
## pct_hh_with_disabled           -0.014***         
##                                 (0.003)          
##                                                  
## Constant                       5.060***          
##                                 (0.110)          
##                                                  
## -------------------------------------------------
## Borough fixed effects             Yes            
## Observations                     4,988           
## R2                               0.487           
## Adjusted R2                      0.482           
## =================================================
## Note:                 *p<0.1; **p<0.05; ***p<0.01
## 
## --- Variance Inflation Factor (VIF) Check ---
##                              GVIF Df GVIF^(1/(2*Df))
## pct_unemployed           2.963209  1        1.721397
## pct_private_rented       3.937279  1        1.984258
## pct_20_24                2.786139  1        1.669173
## pct_new_migrant          6.352613  1        2.520439
## pct_overcrowded          6.648495  1        2.578468
## pct_deprived            12.583765  1        3.547360
## pct_bad_health           8.086237  1        2.843631
## pct_elementary_occup     6.930283  1        2.632543
## pct_disabled             6.841864  1        2.615696
## log_job_density          1.822892  1        1.350145
## pct_no_qual              8.099530  1        2.845967
## pct_hh_with_disabled     8.326912  1        2.885639
## factor(Derived_Borough)  9.847679 31        1.037580
# ==============================================================================
# Model 2 相关性矩阵检验:验证为何需要剔除 Deprivation
# ==============================================================================

# 1. 加载必要的包
if (!require("ggcorrplot")) install.packages("ggcorrplot")
library(tidyverse)
library(ggcorrplot)

# 2. 准备数据
# 提取 Model 2 中包含的所有变量 (含 Deprivation)
df_model2_corr <- df_clean %>%
  select(
    `Log Crime Rate` = log_crime_rate,
    
    # 核心结构变量
    `Unemployment` = pct_unemployed,
    `Deprivation (IMD)` = pct_deprived,      # 重点关注对象
    `Overcrowding` = pct_overcrowded,
    `Private Rented` = pct_private_rented,
    
    # 脆弱性与健康
    `Bad Health` = pct_bad_health,
    `Disability` = pct_disabled,
    `HH w/ Disabled` = pct_hh_with_disabled,
    
    # 社会与人口
    `Youth (20-24)` = pct_20_24,
    `New Migrant` = pct_new_migrant,
    `No Quals` = pct_no_qual,
    `Elementary Occup` = pct_elementary_occup,
    
    # 环境
    `Log Job Density` = log_job_density
  )

# 3. 计算相关性矩阵
corr_matrix_m2 <- cor(df_model2_corr, use = "complete.obs", method = "pearson")

# 4. 绘制热力图
p_corr_m2 <- ggcorrplot(corr_matrix_m2,
  method = "square",       # 方块样式
  type = "lower",          # 只显示下半部分
  lab = TRUE,              # 显示数值
  lab_size = 2.5,          # 字体稍微调小一点,因为变量多
  tl.cex = 10,             # 坐标轴标签大小
  colors = c("#2E9FDF", "white", "#E7B800"), # 蓝-白-黄 配色
  title = "Correlation Matrix: Model 2 (Highlighting Collinearity)",
  ggtheme = theme_minimal() + 
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
)

# 5. 展示并保存
print(p_corr_m2)

ggsave("Model2_Correlation_Matrix.png", p_corr_m2, width = 10, height = 10, bg = "white")

check moodle

Further Model Simplification and Multicollinearity Reduction

To further streamline the model, we reduced the number of variables by eliminating those exhibiting significant multicollinearity (as indicated by high VIF scores in the previous step). A new, parsimonious model was then generated to verify the stability and performance of this simplified specification.

## Number of Boroughs: 32
## 
## =================================================
##                           Dependent variable:    
##                       ---------------------------
##                             log_crime_rate       
## -------------------------------------------------
## pct_unemployed                 0.045***          
##                                 (0.006)          
##                                                  
## pct_private_rented             0.016***          
##                                 (0.001)          
##                                                  
## pct_20_24                        0.005           
##                                 (0.003)          
##                                                  
## pct_new_migrant                0.022***          
##                                 (0.002)          
##                                                  
## pct_overcrowded                0.013***          
##                                 (0.002)          
##                                                  
## pct_bad_health                 0.088***          
##                                 (0.010)          
##                                                  
## pct_elementary_occup           0.010***          
##                                 (0.003)          
##                                                  
## pct_disabled                    0.014*           
##                                 (0.008)          
##                                                  
## log_job_density                -0.250***         
##                                 (0.010)          
##                                                  
## pct_no_qual                    -0.005**          
##                                 (0.003)          
##                                                  
## pct_hh_with_disabled            -0.003           
##                                 (0.003)          
##                                                  
## Constant                       4.802***          
##                                 (0.106)          
##                                                  
## -------------------------------------------------
## Borough fixed effects             Yes            
## Observations                     4,988           
## R2                               0.479           
## Adjusted R2                      0.475           
## =================================================
## Note:                 *p<0.1; **p<0.05; ***p<0.01
## 
## --- Variance Inflation Factor (VIF) Check ---
##                             GVIF Df GVIF^(1/(2*Df))
## pct_unemployed          2.729548  1        1.652134
## pct_private_rented      3.872885  1        1.967965
## pct_20_24               2.773750  1        1.665458
## pct_new_migrant         6.297779  1        2.509538
## pct_overcrowded         5.896427  1        2.428256
## pct_bad_health          7.896560  1        2.810082
## pct_elementary_occup    6.818661  1        2.611257
## pct_disabled            6.839791  1        2.615299
## log_job_density         1.815474  1        1.347395
## pct_no_qual             7.296181  1        2.701144
## pct_hh_with_disabled    6.534689  1        2.556304
## factor(Derived_Borough) 9.296159 31        1.036616

Variable Reduction for Optimal Specification

We proceeded with further variable screening and reduction to identify the optimal combination of predictors for the final model.

## Number of Boroughs: 32
## 
## =================================================
##                           Dependent variable:    
##                       ---------------------------
##                             log_crime_rate       
## -------------------------------------------------
## pct_unemployed                 0.048***          
##                                 (0.006)          
##                                                  
## pct_private_rented             0.016***          
##                                 (0.001)          
##                                                  
## pct_20_24                        0.005           
##                                 (0.003)          
##                                                  
## pct_new_migrant                0.022***          
##                                 (0.002)          
##                                                  
## pct_overcrowded                0.016***          
##                                 (0.002)          
##                                                  
## pct_bad_health                 0.085***          
##                                 (0.010)          
##                                                  
## pct_disabled                     0.010           
##                                 (0.007)          
##                                                  
## log_job_density                -0.250***         
##                                 (0.010)          
##                                                  
## pct_no_qual                     -0.002           
##                                 (0.002)          
##                                                  
## Constant                       4.747***          
##                                 (0.096)          
##                                                  
## -------------------------------------------------
## Borough fixed effects             Yes            
## Observations                     4,988           
## R2                               0.478           
## Adjusted R2                      0.474           
## =================================================
## Note:                 *p<0.1; **p<0.05; ***p<0.01
##                             GVIF Df GVIF^(1/(2*Df))
## pct_unemployed          2.626996  1        1.620801
## pct_private_rented      3.387598  1        1.840543
## pct_20_24               2.606428  1        1.614444
## pct_new_migrant         5.982817  1        2.445980
## pct_overcrowded         4.834472  1        2.198743
## pct_bad_health          7.270813  1        2.696445
## pct_disabled            6.450482  1        2.539780
## log_job_density         1.814910  1        1.347186
## pct_no_qual             5.308752  1        2.304073
## factor(Derived_Borough) 6.720201 31        1.031205

Final Model Refinement and Conclusion

Building upon the previous iteration, we further refined the model by eliminating variables characterized by high multicollinearity or limited explanatory contribution. The resulting streamlined model achieves an optimal balance between model parsimony and explanatory power. This step marks the culmination of the final predictor selection process and concludes the regression analysis phase.

## Number of Boroughs: 32
## 
## =================================================
##                           Dependent variable:    
##                       ---------------------------
##                             log_crime_rate       
## -------------------------------------------------
## pct_unemployed                 0.048***          
##                                 (0.006)          
##                                                  
## pct_private_rented             0.016***          
##                                 (0.001)          
##                                                  
## pct_20_24                        0.004           
##                                 (0.003)          
##                                                  
## pct_new_migrant                0.023***          
##                                 (0.002)          
##                                                  
## pct_overcrowded                0.015***          
##                                 (0.002)          
##                                                  
## pct_bad_health                 0.094***          
##                                 (0.005)          
##                                                  
## log_job_density                -0.250***         
##                                 (0.010)          
##                                                  
## Constant                       4.746***          
##                                 (0.091)          
##                                                  
## -------------------------------------------------
## Borough fixed effects             Yes            
## Observations                     4,988           
## R2                               0.478           
## Adjusted R2                      0.474           
## =================================================
## Note:                 *p<0.1; **p<0.05; ***p<0.01
##                             GVIF Df GVIF^(1/(2*Df))
## pct_unemployed          2.614383  1        1.616905
## pct_private_rented      3.300686  1        1.816779
## pct_20_24               2.465941  1        1.570332
## pct_new_migrant         5.166073  1        2.272900
## pct_overcrowded         3.632450  1        1.905899
## pct_bad_health          1.887181  1        1.373747
## log_job_density         1.780076  1        1.334195
## factor(Derived_Borough) 4.934313 31        1.026080

check moodle

Visualizing Relative Explanatory Power

Based on the variables selected for the final model, we generated a horizontal bar chart to visualize the relative explanatory power of each predictor. To ensure comparability across variables with different units, standardized coefficients were calculated, allowing for a direct assessment of which factors exert the strongest influence on crime rates.

## Number of boroughs: 32
## 
## =================================================
##                           Dependent variable:    
##                       ---------------------------
##                             log_crime_rate       
## -------------------------------------------------
## pct_unemployed                 0.048***          
##                                 (0.006)          
##                                                  
## pct_private_rented             0.016***          
##                                 (0.001)          
##                                                  
## pct_20_24                        0.004           
##                                 (0.003)          
##                                                  
## pct_new_migrant                0.023***          
##                                 (0.002)          
##                                                  
## pct_overcrowded                0.015***          
##                                 (0.002)          
##                                                  
## pct_bad_health                 0.094***          
##                                 (0.005)          
##                                                  
## log_job_density                -0.250***         
##                                 (0.010)          
##                                                  
## Constant                       4.746***          
##                                 (0.091)          
##                                                  
## -------------------------------------------------
## Borough fixed effects             Yes            
## Observations                     4,988           
## R2                               0.478           
## Adjusted R2                      0.474           
## =================================================
## Note:                 *p<0.1; **p<0.05; ***p<0.01

# ==============================================================================
# 逐一展示并保存边际效应图
# ==============================================================================

library(tidyverse)
library(ggeffects)

# 1. 再次确保数据和模型是正确的 (避免之前的 factor 报错)
df_clean <- df_clean %>%
  mutate(Derived_Borough = as.factor(Derived_Borough))

vars_top <- c("pct_unemployed", "pct_private_rented", "pct_20_24", 
              "pct_new_migrant", "pct_overcrowded", "pct_bad_health", 
              "log_job_density")

# 使用清洗后的公式 (去掉 formula 里的 factor() 调用)
formula_clean <- as.formula(
  paste("log_crime_rate ~", paste(vars_top, collapse = " + "), "+ Derived_Borough")
)

model_clean <- lm(formula_clean, data = df_clean)

# 2. 设定我们要看的三个核心变量
vars_of_interest <- c("pct_unemployed", "pct_private_rented", "log_job_density")

# 3. 循环生成、展示并保存每一张图
for (var in vars_of_interest) {
  
  # 计算边际效应
  eff <- ggpredict(model_clean, terms = var)
  
  # 绘图
  p <- plot(eff) +
    labs(
      title = paste("Marginal Effect Analysis:", var), # 标题
      y = "Predicted Log Crime Rate",
      x = var
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(face = "bold", size = 14),
      axis.title = element_text(size = 12)
    )
  
  # [关键步骤] 逐一打印到屏幕
  print(p)
  
  # [可选步骤] 逐一保存为单独的图片文件
  # 文件名会自动根据变量名生成,如 "Effect_pct_unemployed.png"
  filename <- paste0("Effect_", var, ".png")
  ggsave(filename, p, width = 6, height = 5, bg = "white")
  
  cat("已保存图片:", filename, "\n")
}

## 已保存图片: Effect_pct_unemployed.png

## 已保存图片: Effect_pct_private_rented.png

## 已保存图片: Effect_log_job_density.png